In this paper, we learn a diffusion model to generate 3D data on a scene-scale. Specifically, our model crafts a 3D scene consisting of multiple objects, while recent diffusion research has focused on a single object. To realize our goal, we represent a scene with discrete class labels, i.e., categorical distribution, to assign multiple objects into semantic categories. Thus, we extend discrete diffusion models to learn scene-scale categorical distributions. In addition, we validate that a latent diffusion model can reduce computation costs for training and deploying. To the best of our knowledge, our work is the first to apply discrete and latent diffusion for 3D categorical data on a scene-scale. We further propose to perform semantic scene completion (SSC) by learning a conditional distribution using our diffusion model, where the condition is a partial observation in a sparse point cloud. In experiments, we empirically show that our diffusion models not only generate reasonable scenes, but also perform the scene completion task better than a discriminative model. Our code and models are available at https://github.com/zoomin-lee/scene-scale-diffusion
translated by 谷歌翻译
具有提高可传递性的对抗性攻击 - 在已知模型上精心制作的对抗性示例的能力也欺骗了未知模型 - 由于其实用性,最近受到了很多关注。然而,现有的可转移攻击以确定性的方式制作扰动,并且常常无法完全探索损失表面,从而陷入了贫穷的当地最佳最佳效果,并且遭受了低传递性的折磨。为了解决这个问题,我们提出了细心多样性攻击(ADA),该攻击以随机方式破坏了不同的显着特征以提高可转移性。首先,我们将图像注意力扰动到破坏不同模型共享的通用特征。然后,为了有效避免局部优势差,我们以随机方式破坏了这些功能,并更加详尽地探索可转移扰动的搜索空间。更具体地说,我们使用发电机来产生对抗性扰动,每个扰动都根据输入潜在代码而以不同的方式打扰。广泛的实验评估证明了我们方法的有效性,优于最先进方法的可转移性。代码可在https://github.com/wkim97/ada上找到。
translated by 谷歌翻译
光流CNNS的训练管道由合成数据集的预处理阶段组成,然后在目标数据集上进行微调阶段。但是,从目标视频中获得地面真理需要巨大的努力。本文提出了一种实用的微调方法,将预处理的模型调整到没有地面真相流的目标数据集中,但尚未进行广泛探讨。具体而言,我们为自我划分的流程主管提出了一个流程主管,其中包括参数分离和学生量连接。该设计的目的是稳定的收敛性和更好的准确性,而在微调任务上是不稳定的传统自我实施方法。实验结果表明,与半监督学习的不同自学方法相比,我们方法的有效性。此外,我们通过利用其他未标记的数据集来实现对Sintel和Kitti基准测试的最先进的光流模型的有意义的改进。代码可在https://github.com/iwbn/flow-supervisor上找到。
translated by 谷歌翻译
The deep neural network (DNN) models for object detection using camera images are widely adopted in autonomous vehicles. However, DNN models are shown to be susceptible to adversarial image perturbations. In the existing methods of generating the adversarial image perturbations, optimizations take each incoming image frame as the decision variable to generate an image perturbation. Therefore, given a new image, the typically computationally-expensive optimization needs to start over as there is no learning between the independent optimizations. Very few approaches have been developed for attacking online image streams while considering the underlying physical dynamics of autonomous vehicles, their mission, and the environment. We propose a multi-level stochastic optimization framework that monitors an attacker's capability of generating the adversarial perturbations. Based on this capability level, a binary decision attack/not attack is introduced to enhance the effectiveness of the attacker. We evaluate our proposed multi-level image attack framework using simulations for vision-guided autonomous vehicles and actual tests with a small indoor drone in an office environment. The results show our method's capability to generate the image attack in real-time while monitoring when the attacker is proficient given state estimates.
translated by 谷歌翻译
Iris segmentation is the initial step to identify biometric of animals to establish a traceability system of livestock. In this study, we propose a novel deep learning framework for pixel-wise segmentation with minimum use of annotation labels using BovineAAEyes80 public dataset. In the experiment, U-Net with VGG16 backbone was selected as the best combination of encoder and decoder model, demonstrating a 99.50% accuracy and a 98.35% Dice coefficient score. Remarkably, the selected model accurately segmented corrupted images even without proper annotation data. This study contributes to the advancement of the iris segmentation and the development of a reliable DNNs training framework.
translated by 谷歌翻译
We study grammar induction with mildly context-sensitive grammars for unsupervised discontinuous parsing. Using the probabilistic linear context-free rewriting system (LCFRS) formalism, our approach fixes the rule structure in advance and focuses on parameter learning with maximum likelihood. To reduce the computational complexity of both parsing and parameter estimation, we restrict the grammar formalism to LCFRS-2 (i.e., binary LCFRS with fan-out two) and further discard rules that require O(n^6) time to parse, reducing inference to O(n^5). We find that using a large number of nonterminals is beneficial and thus make use of tensor decomposition-based rank-space dynamic programming with an embedding-based parameterization of rule probabilities to scale up the number of nonterminals. Experiments on German and Dutch show that our approach is able to induce linguistically meaningful trees with continuous and discontinuous structures
translated by 谷歌翻译
This work proposes a framework developed to generalize Critical Heat Flux (CHF) detection classification models using an Unsupervised Image-to-Image (UI2I) translation model. The framework enables a typical classification model that was trained and tested on boiling images from domain A to predict boiling images coming from domain B that was never seen by the classification model. This is done by using the UI2I model to transform the domain B images to look like domain A images that the classification model is familiar with. Although CNN was used as the classification model and Fixed-Point GAN (FP-GAN) was used as the UI2I model, the framework is model agnostic. Meaning, that the framework can generalize any image classification model type, making it applicable to a variety of similar applications and not limited to the boiling crisis detection problem. It also means that the more the UI2I models advance, the better the performance of the framework.
translated by 谷歌翻译
Human organs constantly undergo anatomical changes due to a complex mix of short-term (e.g., heartbeat) and long-term (e.g., aging) factors. Evidently, prior knowledge of these factors will be beneficial when modeling their future state, i.e., via image generation. However, most of the medical image generation tasks only rely on the input from a single image, thus ignoring the sequential dependency even when longitudinal data is available. Sequence-aware deep generative models, where model input is a sequence of ordered and timestamped images, are still underexplored in the medical imaging domain that is featured by several unique challenges: 1) Sequences with various lengths; 2) Missing data or frame, and 3) High dimensionality. To this end, we propose a sequence-aware diffusion model (SADM) for the generation of longitudinal medical images. Recently, diffusion models have shown promising results on high-fidelity image generation. Our method extends this new technique by introducing a sequence-aware transformer as the conditional module in a diffusion model. The novel design enables learning longitudinal dependency even with missing data during training and allows autoregressive generation of a sequence of images during inference. Our extensive experiments on 3D longitudinal medical images demonstrate the effectiveness of SADM compared with baselines and alternative methods.
translated by 谷歌翻译
Word Sense Disambiguation (WSD) is an NLP task aimed at determining the correct sense of a word in a sentence from discrete sense choices. Although current systems have attained unprecedented performances for such tasks, the nonuniform distribution of word senses during training generally results in systems performing poorly on rare senses. To this end, we consider data augmentation to increase the frequency of these least frequent senses (LFS) to reduce the distributional bias of senses during training. We propose Sense-Maintained Sentence Mixup (SMSMix), a novel word-level mixup method that maintains the sense of a target word. SMSMix smoothly blends two sentences using mask prediction while preserving the relevant span determined by saliency scores to maintain a specific word's sense. To the best of our knowledge, this is the first attempt to apply mixup in NLP while preserving the meaning of a specific word. With extensive experiments, we validate that our augmentation method can effectively give more information about rare senses during training with maintained target sense label.
translated by 谷歌翻译
Video-grounded Dialogue (VGD) aims to decode an answer sentence to a question regarding a given video and dialogue context. Despite the recent success of multi-modal reasoning to generate answer sentences, existing dialogue systems still suffer from a text hallucination problem, which denotes indiscriminate text-copying from input texts without an understanding of the question. This is due to learning spurious correlations from the fact that answer sentences in the dataset usually include the words of input texts, thus the VGD system excessively relies on copying words from input texts by hoping those words to overlap with ground-truth texts. Hence, we design Text Hallucination Mitigating (THAM) framework, which incorporates Text Hallucination Regularization (THR) loss derived from the proposed information-theoretic text hallucination measurement approach. Applying THAM with current dialogue systems validates the effectiveness on VGD benchmarks (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows enhanced interpretability.
translated by 谷歌翻译